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Abstract 

The scan statistic is by far the most popular method for anomaly detection, being popular 
in syndromic surveillance, signal and image processing, and target detection based on sensor 
networks, among other applications. The use of the scan statistics in such settings yields a 
hypothesis testing procedure, where the null hypothesis corresponds to the absence of anomalous 
behavior. If the null distribution is known, then calibration of a scan-based test is relatively 
easy, as it can be done by Monte Carlo simulation. When the null distribution is unknown, 
it is less straightforward. We investigate two procedures. The first one is a calibration by 
permutation and the other is a rank-based scan test, which is distribution-free and less sensitive 
to outliers. Furthermore, the rank scan test requires only a one-time calibration for a given 
data size making it computationally much more appealing. In both cases, we quantify the 
performance loss with respect to an oracle scan test that knows the null distribution. We show 
that using one of these calibration procedures results in only a very small loss of power in 
the context of a natural exponential family. This includes the classical normal location model, 
popular in signal processing, and the Poisson model, popular in syndromic surveillance. We 
perform numerical experiments on simulated data further supporting our theory and also on a 
real dataset from genomics. 


1 Introduction 

Signal detection (and localization) is important in a large variety of applications, encompassing 
any situation where the goal is to discover patterns or detect/locate anomalies. Our focus is on the 
detection of anomalous behavior which is endowed with some structure. For instance, one might 
have data consisting of the physical location of a sensor and the corresponding measurement, 
and would like to determine if there is a spatial region where measurements are unusually high 
(Balakrishnan and Koutras, 2002). A standard way to tackle this problem is the use of a scan 
statistic which essentially inspects all (or at least a large number of) possible anomalous patterns. 
It usually corresponds to a form of generalized likelihood ratio test (Kulldorff, 1997). In (Cheung 
et ah, 2013) the scan statistic was used to detect small geographic areas with large suicide rates 
and (Guerriero et ah, 2009) used the scan statistic for target detection using distributed sensors 
in a two dimensional region. Although computationally this approach might be challenging, there 
are a number of situations where it is possible to compute the scan statistic in nearly linear time 
(Arias-Castro et ah, 2005; Neill, 2012; Neill and Moore, 2004; Walther, 2010). 

For the purpose of illustration, consider the following prototypical example^: suppose we have 
event data over a certain time period and want to detect if there is a time interval with an unusually 

^In fact, this setting might have been the original motivation for the work on the scan statistic (Wallenstein, 2009). 
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high concentration of events. To make things more concrete and move towards the setting we 
consider in this paper, assume one can model these event data as a realization of a Poisson process 
and bin the data, so that we observe a sequence of Poisson random variables. The scan statistic 
in this particular case combines sums of these values over (discrete) intervals of different sizes and 
location, together with some normalization — see (2) further down. In this scenario we want to 
perform a hypotheses test, where the null hypothesis is that no anomaly is present (a homogenous 
Poisson process) versus the alternative where some intervals have an elevated rate of events (an 
inhomogenous process). If the (constant) rate is known under the null, then the null distribution is 
completely specified and the test can be calibrated either analytically or by Monte Carlo simulation. 
But what if the null event rate is unknown? What are possible ways to properly calibrate the test? 
What is the price to pay in terms of power? 

One can regard the scan statistic as a comparison between observations in one interval to those 
outside the interval. This point of view leads naturally to a two-sample problem for each interval, 
which is then followed by some form of multiple testing since we scan many intervals. Thus drawing 
from the classical literature on the two-sample problem, two approaches can be considered: 

• Calibration by permutation. This amounts to using the permutation distribution of the scan 
statistic for inference (detection/estimation). 

• Scanning the ranks. This amounts to replacing each observation with its rank before scanning. 
Calibration of such a test can be done by Monte Carlo simulation before the observation of 
data, as long as the size of the data is known. 

The perspective offered by the two-sample testing framework makes these two procedures very 
natural. The permutation scan has been suggested in a number of papers and applied in a number 
of ways in different contexts. It is a standard approach in neuroimaging (Nichols and Holmes, 2002) 
and is suggested in syndromic surveillance (Huang et ah, 2007; Kulldorff et ah, 2005, 2009). It was 
suggested by Walther (2010) in the context of a sensor network with binary output and by Flenner 
and Hewer (2011) in the context of detecting a change in a sequence of images. 

Surprisingly enough, the method based on ranks appears to be relatively new in the present 
context. It was specifically (and simultaneously) proposed as a standalone procedure by Jung and 
Cho (2015)^, where the authors compute the scan statistic on ranks instead of the data itself. Nev¬ 
ertheless, rank-based methodologies have been used earlier in similar settings, but with a different 
purpose in mind. For instance, the use of ranks in the context of the scan statistic also appears in 
(McFowland et ah, 2013) through the computation of empirical P-values. It is important to note 
that the use of ranks in the last reference is of a rather different nature than that we propose in 
our work, and that the emphasis in that paper is on the ability to efficiently compute/approximate 
scan statistics, while in our work the emphasis is on the calibration of scan tests when the null 
distribution is not known. 

Although less popular, as in the two-sample testing setting, a procedure based on ranks offers 
some significant advantages over calibration by permutation: (i) it is more robust to outliers and ; 
(ii) its calibration can be done by Monte Carlo simulation and requires only the knowledge of the 
sample size^. Point (ii) is rather pertinent, as computationally this is a huge advantage over cali¬ 
bration by permutation. Furthermore, this property is rather advantageous if one desires to apply 

^This article was made public after our paper was posted on the arxig.org. To the best of our knowledge, this 
other publication became publicly available on October 20, 2015 (doi: 10.1186/sl2942-015-0024-6), a couple of 
months after ours appeared online on August 12, 2015. 

^The latter explains why, in two-sample testing, methods based on ranks were feasible decades before methods 
based on permutations, which typically require access to a computer. 
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the test repeatedly on several datasets of same size; compare with a calibration by permutation: 
typically, several hundred permutations are sampled at random and, for each one of them, the scan 
statistic is computed, and all this is done each time the test is applied. 

In this paper we study the performance of both the permutation and rank scan methods, 
providing strong asymptotic guarantees as well as insights on the their finite-sample performance 
in some numerical experiments. In the context of a natural exponential family — which includes the 
classical normal location model and the Poisson example above — we find that the permutation 
scan test and the rank scan test come very close to performing as well as the oracle scan test, 
which we define as the scan test calibrated by Monte Carlo with (clairvoyant) knowledge of the 
null distribution. We perform numerical experiments on simulated data which confirm our theory, 
and also some experiments using a real dataset from genomics. 

As specified below, we focus on a “static” setting, where the length of the signal being monitored 
is fixed a priori. Adding time is typically done by adding one ‘dimension’ to the framework, as 
done for example in (Kulldorff et ah, 2005). 

1.1 General setting 

A typical framework for static anomaly detection — which includes detection in digital signals and 
images, sensor networks, biological data, and more — may be described in general terms as follows. 
We observe a set of independent random variables, denoted (Xy : u e V), where V is a finite index set 
of size N. This is a snapshot of the state of the environment, where each element of V corresponds 
to an element of the environment (e.g., these correspond to nodes of a network, pixels in an image, 
genes, etc.). In this work we take a hypothesis testing point of view. Under the null hypothesis, 
corresponding to the nominal state when no anomalies are present, these random variables are 
Independent and Identically Distributed (IID) with distribution Fq. Under the alternative, some 
of these random variables will have a different distribution. Formally, let S c 2^ denote a class of 
possibly anomalous subsets, corresponding to the anomalous patterns we expect to encounter (this 
would be a class of intervals in the example that we used earlier). Under the alternative hypothesis 
there is a subset 5 e § such that, for each v € S, Xy ~ Fy for some distributions Fy Fq, and 
independent of {Xy : u e V \ 5), which are still IID with distribution Fq. In a number of important 
applications the variables are real-valued and the anomalous variables take larger-than-usual values, 
which can be formalized by the assumption that each Fy stochastically dominates"^ Fq. We take this 
to be the case throughout most of the paper. While the standard scan test is calibrated by Monte 
Carlo by repeated sampling from the null distribution Fq, in contrast, the procedures we study here 
— the permutation scan test and the rank scan test — are calibrated without any knowledge of Fq 
and Fy. 

1.2 Exponential models 

Although some of our results will be presented in the general setting above, it is useful to consider 
an important special case. This serves as a benchmark we can use to compare the performance of 
the proposed procedures against that of the optimal tests. Doing so is classical in the literature 
on nonparametric tests (Hettmansperger, 1984), where such a test is compared with the likelihood 
ratio test in some parametric model (often a location model or a scale model). 

In this paper we consider a generic one-parameter exponential model in natural form. Let Fq 
be a probability distribution on the real line with all the moments finite. This distribution can be 

^For two distribution (functions) on the real line, F and G, we say that G stochastically dominates F if G(t) < F(t) 
for all t s R. We denote this hy G > F. 
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either continuous (i.e., diffuse), discrete (i.e., with discrete support) or a mixture of both. In the 
exponential model there is a parameter 9^ associated with each u e V, and the distribution = Fg^ is 
defined through its density fg^ with respect to Fq: for 6 e [0,0*), define fg{x) = exp(0x-log(/?o(0)), 
where ^po{9) = J e^^dFo{x) and 0* = sup{0 > 0 : </?o(0) < o°}) assumed to be strictly positive (and 
possibly infinite). In other words, fg^ denotes the Radon-Nykodym derivative of Fg^ with respect 
to Fq. Since a natural exponential family has the monotone likelihood ratio property^, it follows 
that Fg is stochastically increasing in 0 (Lehmann and Romano, 2005, Lem 3.4.2). In particular, 
we do have Fg > Fq for all 0 > 0. Important special cases of such an exponential model include 
the normal location model — with Fg corresponding to AA(0,1) — standard in many signal and 
image processing applications; the Poisson model — with Fg corresponding a Poisson distribution 
— popular in syndromic surveillance (Kulldorff et ah, 2005); and the Bernoulli model (Walther, 
2010) with Fg corresponding to a Bernoulli distribution. 

Note that in the formulation above the alternative hypothesis is composite. Tackling this 
problem using a generalized likelihood ratio approach is popular in practice (Kulldorff, 1997) and 
often referred to as the scan test, as it works by scanning over the possible anomalous sets to 
determine if there is such a set that is able to “explain” the observed data. Assuming the nonzero 
0^’s are all equal to 0 under the alternative, and that all subsets in the class S have same size, some 
simplihcations lead to considering the test that rejects for large values of the scan statistic 


max 

5eS 


veS 


( 1 ) 


When the subsets in the class S may have different sizes, a more reasonable approach includes a 
normalization of the partial sums above, leading to the following variant of the scan statistic 

max^^(A,-Eo(X,)) . (2) 

V l‘^l ^^*5 

(Ee denotes the expectation with respect to Fg, and for a discrete set S, |5| denotes its cardinality.) 
As argued in (Arias-Castro and Grimmett, 2013), this test is in a certain sense asymptotically 
equivalent to the generalized likelihood ratio test. 


1.3 Calibration by permutation 

Suppose we are considering a test that rejects the null for large values of a test statistic T(X) 
where X = {Xy,v e V). Let x = {xy,v e V) the observed value of X. If we were to know the 
null distribution Fq, we would return the P-value as Fq(T(X) > T(x)). In practice, even with the 
knowledge of Fq computing the exact P-value might be difficult, but one can approximate it to an 
arbitrary accuracy and estimate it by Monte Carlo simulation. 

Ignoring computational constraints for the moment, calibration by permutation amounts to 
computing r(x^) for all vr e V!, where V! denotes the set of all permutations of V and x^^ = 
(rr^(^),u e V) is the permuted data. We then return the P-value 

|^|{7re V! :T(x^) > r(x)}| 

and the rejection decision is based on this value. Let M = |{r(x 7 r) : tt e V!}|. If there are no 
multiplicities, meaning M = V!, it can be shown such tests are exact and that under the null the P- 
value has a (discrete) uniform distribution on {1/M, 2/M,...,!}. Otherwise the test will be slightly 

®A family of densities {fg : 9 s 0), where 0 c R, has the monotone likelihood ratio property if fg'{x)/fsix) is 
increasing in x when 9' > 9. 
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conservative (Lehmann and Romano, 2005). In practice, the number of permutations is very large 
(as |V!| = |V|!) and the P-value is estimated by simulation (by uniform sampling of permutations). 

In our setting, T above will be a form of a scan statistic, similar to the one in (2), which 
maximizes a standardized sum of data entries over a class § of possible anomalous sets. When 
calibrating by permutation we are comparing the value r(x) of this statistic on the original data 
X with the corresponding value on permuted data x^^. This is only sensible if the class S 

has some structure, and in particular it cannot be invariant under permutations. In this paper we 
consider what is perhaps the simplest such class, which is the class of intervals 

V = {1,..., iV} and S = {{a, ...,5}:l<a<5<A^} . 

In the next section we elaborate on other possible structural constrains, and the theoretical approach 
we develop can be used to study the calibration by permutation in those settings as well. 

Assuming T has been chosen, we define the oracle scan test as the scan test calibrated with full 
knowledge of the null distribution by Monte Carlo simulation, and the permutation scan test as 
the scan test calibrated by permutation as explained above. 

Contribution 1: We characterize the performance of the permutation scan test in the context of 
the exponential family, concluding that it has as much asymptotic power as the oracle scan test 
(Theorem 1 ). 

We note that permutation tests are known to perform this well in classical two-sample testing 
(Lehmann and Romano, 2005). However, in the context of the scan test, we are only aware of one 
other paper, that of Walther (2010), that develops theory for the permutation scan test. This is 
done in the context of binary data (a Bernoulli model). Our analysis extends the theory to any 
natural exponential model as described in Section 1.2 (which also includes the binary case). This 
requires a different set of tools. 

1.4 Scanning the ranks 

As explained earlier, when calibrating by permutation the computation of the scan statistic T must 
be done for a large enough number of permutations of the original dataset. Even though this is done 
for only a relatively small number of permutations, that number is often chosen in the hundreds, 
if not thousands, meaning that the procedure requires the computation of that many scans. Even 
if the computation (in fact, approximation) of the scan statistic is done in linear time this can be 
rather time consuming. Eurthermore, for a new instantiation of the data the whole procedure must 
be undertaken anew. The computational burden of doing so may be prohibitive in some practical 
situations, for instance, when monitoring a sensor network in real-time. 

To mitigate those drawbacks we propose instead a rank-based approach, which avoids the 
expensive calibration by permutation. The procedure amounts to simply replacing the observations 
with their ranks® before scanning, so that we end up scanning the ranks instead of the original 
values. If ties in the ranks are broken randomly the resulting test statistic is distribution-free and 
therefore can be calibrated by Monte Carlo simulation requiring only the knowledge of the data 
size (which \s N = |V| in our context). In terms of computational complexity this procedure is as 
complex as the implementation of a scan test when the null distribution is fully known so there is 
no computational disadvantage in using ranks. In fact faster implementations might be possible by 
taking advantage of the discrete nature of the ranks and avoiding floating-point algebra, but these 
algorithmic considerations are beyond the scope of this paper. 

^Throughout, the observations are ranked in increasing order of magnitude. 
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Contribution 2: We establish the performance of the rank scan test (Theorem 2 and and Propo¬ 
sition 3). In the context of the exponential family we show that it has nearly as much asymptotic 
power as the oracle scan test (Proposition 2). 

This result is remarkable in the sense that the scan test can be completely calibrated before 
any data has been observed, and yet attain essentially the same power as the optimal test with 
full knowledge of the statistical model. Such a procedure is very natural (albeit distinct) given 
the classical literature on nonparametric tests (Hettmansperger, 1984), and rank tests such as 
Wilcoxon’s are known to perform this well in classical two-sample testing (Hettmansperger, 1984; 
Lehmann and Romano, 2005). 

Our results allow us to precisely quantify how much (asymptotic) power is lost when using the 
rank scan test versus the oracle scan test. For example, in the normal means model the rank scan 
test requires a signal magnitude 1.023 times larger than the regular scan test to be asymptotically 
powerful against anomalous sets that are not too small. 

1.5 Structured anomalies 

Naturally, the intrinsic difficulty of the detection task depends not only on the data distribution, 
but also on the complexity of the class of anomalous sets S. Furthermore, for the permutation or 
rank-based approaches to be sensible this class must have some structure and not be invariant under 
permutations, as seen above. In several scenarios structural assumptions on such classes arise very 
naturally. For instance, grid-like networks are an important special case, arising in applications 
such as signal and image processing (where the signals are typically regularly sampled) and sensor 
networks deployed for the monitoring of some geographical area, for example. This situation is 
considered in great generality and from different perspectives in (Arias-Castro et ah, 2011, 2005; 
Cai and Yuan, 2014; Desolneux et ah, 2003; Hall and Jin, 2010; Perone Pacifico et ah, 2004; Walther, 
2010). Also, the distribution of the corresponding scan statistic (2) and variants has been studied 
in a number of places (Boutsikas and Koutras, 2006; Jiang, 2002; Kabluchko, 2011; Sharpnack and 
Arias-Castro, 2014; Siegmund and Venkatraman, 1995). 

The simplest and most emblematic setting is that of detecting an interval in a one-dimensional 
regularly sampled signal, that was highlighted above. However, the principles underlying the de¬ 
tection of intervals can be used for the detection of much more general anomaly classes. As shown 
in (Arias-Castro et ah, 2011), similar results apply to a general (nonparametric) class S of blob-like 
(‘thick’) sets S when V is a grid-like set of arbitrary finite dimension, although the scanning is done 
over an appropriate approximating net for S (instead of the entire class S). Furthermore, these 
results generalize to one-parameter exponential models, beyond the commonly assumed normal lo¬ 
cation model, as long as the sets 5 e S are sufficiently large (poly-logarithmic in N). Other papers 
that develop theory for different environments include (Addario-Berry et ah, 2010; Arias-Castro 
et ah, 2008; Sharpnack and Singh, 2010; Sharpnack et ah, 2013; Zhao and Saligrama, 2009). Vari¬ 
ants of this detection problem have been suggested, and the applied literature is quite extensive. 
We refer the reader to (Arias-Castro et ah, 2011) and references therein. 

Since the main motivation of our work is to develop methods and theory for the scenario when 
the distributions are unknown/unspecified we focus exclusively on the detection of intervals, for 
the sake of clarity and simplicity. Nevertheless our techniques and results apply naturally to more 
general anomaly classes (e.g., rectangles in two or more dimensions, or even blob-like subsets). The 
key to these generalizations are proper concentration inequalities for sampling without replace¬ 
ment, namely Lemmas 2 and 4, and a geometric characterization of the anomaly class in terms of 
an approximating net akin to Lemma 1. The latter characterization is heavily dependent on the 
class of anomalous sets under consideration, as described in the preceding paragraph. Furthermore, 
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although it is possible to study a version of the test than scans over all possible anomalous sets, 
we choose to study a scan test restricted to an approximating net because of the following advan¬ 
tages: the analysis is simpler as it does not require the use of chaining to achieve tight constants; 
it is applicable in more general settings, in particular when the class § is nonparametric; it is 
computationally advantageous as it gives rise to fast implementations. 

1.6 Content and notation 

The rest of the paper is organized as follows. In Section 2 we consider the case when the null 
distribution is known. This section is expository, introducing the reader to the basic proof tech¬ 
niques that are used, for example, in (Arias-Castro et ah, 2011), to establish the performance of 
the scan statistic when calibrated with full knowledge of the null distribution — the oracle scan 
test, as we called it here. To keep the exposition simple, and to avoid repeating the substantially 
more complex arguments detailed in that paper and others, we focus on the problem of detecting 
an interval in a one-dimensional lattice. This allows us to set the foundation and discover what the 
performance bounds for the scan test in this case rely on. In Section 3 we consider the same setting 
and instead calibrate the scan statistic by permutation. In Section 4 we consider the same setting 
and instead scan the ranks. In both cases, our analysis relies on concentration inequalities for sums 
of random variables obtained from sampling without replacement from a finite set of reals, already 
established in the seminal paper of Hoeffding (1963). In Section 5 we perform some simulations to 
numerically quantify how much is lost in finite samples when calibrating by permutation or when 
using ranks. We also compare our methodology with the method of Cai et al. (2012), on simulated 
data, and also on a real dataset from genomics. Section 6 is a brief discussion. Except for the 
expository derivations in Section 2, the technical arguments are gathered in Section 7. 

2 When the null distribution is known 

This section is meant to introduce the reader to the techniques underlying the performance bounds 
developed in (Arias-Castro et ah, 2011, 2005) for the scan statistic (and variants) when the null 
distribution is known. These provide a stepping stone for our results in regards to permutation 
and rank scan tests. We detail the setting of detecting an interval of unknown length in a one¬ 
dimensional lattice. Therefore, as in Section 1.3, consider the setting where 

V = {1,..., iV} and S = {{a, ...,5}:l<a<5<A^} . 

We begin by considering the normal model — ^ M(By, 1) are independent — and explain later 

on how to generalize the arguments to an arbitrary exponential model as described in Section 1.2. 
We are interested in testing 

Ho ■ By = 0,^v €V versus Hi ■■ 3S € E>: ^ ^ 0. > TV21og(iV)/|cS| , (3) 

viS 

where r > 0 is fixed. We consider this problem from a minimax perspective. It is shown in (Arias- 
Castro et ah, 2005) that, if r < 1, then any test with level a has power at most l3{a,N), with 
f3{a, N) a as iV ->• oo. In other words, in the large-sample limit, no test can do better than random 
guessing — the test that rejects with probability a regardless of the data. On the other hand, if 
r > 1, then for any level a > 0 there exists a test with level a and power l3(a,N) 1 as N ^ cx>. 

In particular, such a test can be constructed using a form of scanning over an approximating net, 
as explained in the rest of this section. 
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Step 1: Construction of an approximating net. Instead of scanning over S we will scan over a 
subclass of intervals E>b, where 0 < 6 < is an integer to be specified later on. This brings both 
computational and analytical advantages over scanning all sets in S as discussed in Section 1.5. Such 
a subclass must satisfy two important properties, namely have cardinality significantly smaller than 
§, and be such that any element 5 e S can be “well approximated” by an element of S* e E>b- By 
well approximated we mean that p{S,S*) ~ 1 where 


piS,S*)-.= 




is a measure of similarity of two sets. We use an approximating net similar to that of (Arias-Castro 
et ah, 2005); see (Sharpnack and Arias-Castro, 2014) for an alternative construction. 

To simplify the presentation assume is a power of 2 (namely N = 2‘> for some integer q). Let 
Dj denote the class of dyadic intervals at scale j, meaning of the form 5 = [1 + k2^, (k + 1)2-^] c V 
with j and k nonnegative integers. Let denote the class of intervals of the form S u S' with 
S,S' e Dj_i. Note that Dj c Oyo- Then, for 1 < /c < 6, let be the class of intervals of V of 
the form S'left u S' u Sright, where S e while Sieft (resp. Sright) is adjacent to S on the left 

(resp. right) and is either empty or in Note that ^j,k by construction. In the last 

step, is of the same form as before, only the appended intervals Sieft and Sright are either empty, 
or in Finally, define Eb = 

We can prove the following result for this approximating net, using similar arguments to those 
of Arias-Castro et al. (2005). 

Lemma 1. The subclass c S has cardinality at most and is such that for any element 

S e S there is an element S* e Eb satisfying S c S* and p{S,S*) > (1 + . 

Remark 1. It is easy to see that the subclass can be scanned in 0{Nb4^) operations — this 
is implicit in (Arias-Castro et ah, 2005). Indeed, we start by observing that scanning all dyadic 
intervals can be done in 0{N) operations by recursion, starting from the smallest intervals and 
moving up (in scale) to larger intervals. We then conclude by realizing that each interval in is 
the union of at most 2b+ 2 dyadic intervals. 


Step 2: Definition of the scan test. We consider a test based on scanning only the intervals in S;,. 
This test rejects the null if 

maxYs > y/2(l + ? 7 )logAi with := ^ , (4) 

\/\S\ viS 

where p > 0 satishes ry ->• 0 and 7ylog(N) ->• oo. (The reason for these conditions will become clear 
shortly.) 

Step 3: Under the null hypothesis. By the union bound, we have 

Po (maxTs > a/ 2(1 + ry) log a] ^ E ^0 (^^5 ^ \/2(l + ry) log a) 

\-5€Si, / V / 

<|Sb|^(V2(l + ry)logA) , 

where denotes the standard normal distribution function and <!> = l-4> denotes the corresponding 
survival function. We have the well-known bound on Mill’s ratio: 


l>(x) < 


Vx > 0 . 


(5) 
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Therefore we get 

Po (maxT^ > ^2(1 + ??) logiv) < iV4^+iiV-(i+^) = Ar-’74^+i . 

We choose b = |r/log(A^)/log(4). With our assumption that rylogAi ^ oo, this makes the last 
expression tend to zero as iV ^ oo. (It also implies that 6 ->• oo, which we use later on.) We 
conclude the test in (4) has level tending to 0 as ^ oo. 

Step 4- Under the alternative. We now show that the power of this test tends to 1 when r > 1. 
Let S denote the anomalous interval. Referring to Lemma 1, there is a set S* e Sf, such that 

so that p{S,S*) = l + o(l) since 6 ^ oo. Furthermore Ys* is normal with 
mean at least p{S,S*)t\/2 log and variance 1. We thus have 

> V2(l + ??)logAr) > 1.(0 , 

where 


C := yj2{ l + p)\ogN-p{S,S*)Tyj2 log N 
= \/2(l + r/)logAl(l - (1 +o(l))r/v^l +7/) 

--(^-l)V21ogAr^ — OO , 

where we used the fact that r > 1 is fixed and p ^ 0. We conclude that the test in (4) has power 

tending to 1 as Ai ^ oo. In conclusion, we have shown the following result. 

Proposition 1 (Arias-Castro et al. (2005)). Refer to the hypothesis testing problem in (3). The 
test defined in (4), with p = pn 0,p]s[^ogN oo and b - bjsi = ^p^logN, has level converging of 

0 as N ^ cx). Moreover when t > 1 it has power converging to 1 as N ^ oo. 

We remark that, in principle, we may choose any b = bj\f oo such that bjsr/log N 0. From 
Remark 1 the computational complexity of the resulting scan test is of order 0{Nbi<i^^). For 
example, b^ ~ log log A^ is a valid choice and the resulting scan test runs in 0{Npolylog{N)) time. 

2.1 Generalizations 

The arguments just given for the setting of detecting an anomalous interval under a normal location 
model can be generalized to the problem of detecting other classes of subsets under other kinds of 
distributional models. We briefly explain how this is done. (Note that these generalizations can be 
combined.) 

Other classes of anomalous subsets For a given detection problem, specified by a set of nodes 
V and a class of subsets § c 2^, the arguments above continue to apply if one is able to construct 
an appropriate approximating net as in Lemma 1. This is done, for example, in (Arias-Castro 
et ah, 2011, 2005) for a wide range of settings. We note that the construction of a net is purely 
geometrical and/or combinatorial. 

Other exponential models To extend the result to an arbitrary (one-parameter, natural) expo¬ 
nential model, we require the equivalent of the tail-bound (5). While such a bound may not apply 
to a particular exponential model, it does apply asymptotically to large sums of IID variables from 
that model by Chernoff’s bound and a Taylor development of the rate function. 
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Indeed, recalling the notation introduced in Section 1.2, let V'o(^) = sup^^^^g - log(/?o(A)), 

which is the rate function of Fq. By Chernoff’s bound, we have 

Fo{Ys > y) < exp ( - |cS|^o(y|cSr'/ 2 )J _ (g) 

Assuming without loss of generality that Fq has zero mean and unit variance, we have 

i’oit) ^+ 0{t^) , t^O. (7) 

To see this, note that <y7o(A) is infinitely many times differentiable when A e [0,0*), with V9o(0) = 
Eo(A) = 0 and (y^g(0) = Eg(A^) = 1. Therefore (po{^) = 1 + ^A^ + O(A^) as A 0. For t e [0, 9*), we 
then have 

•00(i) = sup [At-<y9g(A)] >t^-log(y9g(t) 

Ae[0,6»*) 

= - log ^1 + + O(t^) j > + 0{t^) , 

where we use log(l + x) < x. From this we see that our derivations for the normal model apply 
essentially verbatim if, for some constant c > 0, |5| > c(log for all 5 e S. Furthermore, it can be 
seen that the test in (4) is essentially optimal for exponential models, as its performance matches 
the lower bounds in (Arias-Castro et ah, 2011). 


3 Calibration by permutation 


Having described in detail how a performance bound is established for the scan test variant (4) 
for the problem of detecting an interval of unknown length, and its extensions to other detection 
problems, we now clearly see that the key to adapting this analysis to a calibration by permutation 
is a concentration of measure bound to replace (5) and (6). Since this is the same in any detection 
setting, we consider as in Section 2 the problem of detecting an interval of unknown length. This 
time, we impose a minimum and maximum length on the intervals 

S = {{a,...,6}:l<a<6<A^,2'?'<6-a<2^“} . (8) 

Indeed, when calibrating the scan test by permutation, we necessarily have to assume nontrivial 
upper and lower bounds on the size of an anomalous interval. To see this consider intervals of 
length one. Then the value of the scan for any permutation of the data is the same. By symmetry 
the same reasoning applies for intervals of length A^ - 1. 

We consider essentially the same form of the scan statistic (2) as before, but replace Eg(A^) 
(which we do not have access to) by A = ^ Ei;eV and scan over an approximating net. We restrict 
the approximating net to match the class of intervals defined in (8) (but still call it E>b for simplicity). 
Specifically we only keep an element S* e Eb if there is 5 e S such that p{S,S*) > (1 + 2“^’^^)“^/^. 
This ensures that the statements in Lemma 1 still hold, and also that |5*| > 2'^'/(l + 2“^’''^) for all 
S* € Eb- In detail, with x = {x^,v e V) denoting the observed data, we define 


SCAN(x) = max (F 5 (x) - , ^^(x) := 




The test rejects the null at significance level a e (0,1) when 

1 


<P(x) := ^ ■ SCAn(x^) > SCAN(x)}j < a , 


where ^(x) is the permutation P-value. 


(9) 


( 10 ) 
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Theorem 1. Refer to the hypothesis testing problem in (3) and assume Fq has zero mean and 
variance one. Consider the test that rejects the null i/^(X) < a, where ^ is defined in (10), with 
b - b]y oo and b^l^ogN -^0 at n ^ oo. This test has level at most a. Furthermore, assume that 
under the alternative the anomalous set S belongs to § defined in (8) with qi -31og2logA/^ ^ +oo 
and Qu - log2 N -cx> asX^oo. This test has power converging to 1 as N cxi when 

^ ^ 0, > T^2log{N)/\S\, with T > 1 fixed, 
l‘^l v^S 

provided that either Fq has compact support or max^ 9^ <9 < 6.„ for some fixed 0 > 0. 

The headline here is that a calibration by permutation has as much asymptotic power as a 
calibration by Monte Carlo with full knowledge of the null distribution (to first-order accuracy). 
This is (qualitatively) in line with what is known in classical settings (Lehmann and Romano, 
2005). Note that this testing procedure makes no assumptions about Fq or about the existence of 
an underlying exponential model. 

Remark 2. The assumption that Fq has zero mean and variance one is without any loss of generality, 
and merely for clarity of presentation. In general, the permutation-based test is asymptotically 
powerful under the alternative if there is a set 5 e S such that 

7^7 X! ^2log(iV)/|cS|, with r > 1 fixed, 

l‘^l v<iS *^0 

where Uq denotes the variance of Fq. 

The conditions required here allow S to be any class of intervals of lengths between (log 
and o{N), for any a > 0 fixed. This includes the most interesting cases of intervals not too short 
and also not too long. In fact, for certain families of distributions removing from consideration 
very small intervals is essential and cannot be avoided. 

Example 1. For instance consider the Bernoulli model, where ~ Bernoulli(l/2), for all u e V 
under the null, and X„ ~ Bernoulli(l), for all u e 5 when S is anomalous. Even under the null 
we will encounter a run of ones of length ~ log 2 N (the famous Erdos-Renyi Law) with positive 
probability. Therefore in this case the scan test, calibrated by Monte Carlo or permutation, is 
powerless for detection of intervals of length | log 2 N. In fact, it can be shown that no test has any 
power in that case. 

Note that, when calibrating a test by permutation there are essentially two sources of random¬ 
ness. The randomness intrinsic to the data X, and the randomness induced by the permutation. In 
particular, if we regard vr as a uniform random variable over the set of possible permutations V! the 
P-value of the test can be re-written as ip(X) = P (sCAN(X,r) ^ SCAn(X)). Under the null hypoth¬ 
esis the argument is classic: for any given permutation vr, the distribution of X is identical to the 
distribution of X^^, therefore SCAN(X) is conditionally uniformly distributed in {sCAN(X 7 r) : vr e V!} 
(with multiplicities). The bulk of the effort in the proof is to characterize the behavior of the test 
under the alternative. The first step is to, conditionally on the data X, “remove” the randomness 
in vr. Realizing that for any S, is simply a sum of elements sampled without replace¬ 

ment from X, we are able to use a concentration inequality for sampling without replacement to 
upper-bound the P-value by an expression involving SCAn(X), the sample mean and variance of 
X, and maxt, Xy. The remainder of the proof consists in controlling those terms for the exponential 
model. 

For technical reasons, we place an upper bound 9 on the nonzero 0„’s to streamline the proof 
arguments and be able to control max^X^. However, note that this condition is not a simple 
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artifact of the proof technique and its removal will invalidate the statement. A way around this 
assumption is to state the result in terms of min„e 5 instead of T,viS use censoring prior 

to scanning (see the discussion in Section 6). 


4 Scanning the ranks 

Having observed x = {x^^v e V), scanning the ranks amounts to replacing every observation with 
its rank among all the observations, and computing the scan (9). We call this the rank scan. As 
for all rank-based methods, the null distribution is the permutation distribution when there are no 
ties. 


• When there are no ties with probability one, calibration of the distribution of the test statistic 
is determined by the data size N^ and therefore the test can be calibrated by Monte Carlo 
simulation before data is observed. 

• When there are ties the rank scan test can be also calibrated by permutation. If one breaks 
ties using the average rank then calibration must be done anew for any given dataset. A 
much better alternative is to break ties randomly so that we are back in the first case, and 
can calibrate the test before seeing the data. The latter option is computationally superior 
and is the one we analyze. 

In summary, the rank scan test is computationally more advantageous, when compared with 
the test of the previous section, calibrated by permutation. An additional advantage of the rank 
scan is its robustness to outliers — although the permutation scan after censoring (discussed in 
Section 6) is also robust to outliers. See Section 5 for implementation issues and a computational 
complexity analysis. 

Formally, let x = (x^,u e V) denote the observations as before, and for every u e V, let be the 
rank (in increasing order) of in x, where ties are broken randomly, and let r = (ry,v e V) be the 
vector of ranks. The rank scan test returns the P-value ip(r) defined in (10). 

Because the rank scan test is naturally regarded as a kind of permutation scan test, we assume 
similarly upper and lower bounds on the size of the anomalous set as in Section 3. The first result 
we present is rather general, and it is not particular to the exponential family and applies to the 
general setting in Section 2.1. For rank-based procedures the performance will depend naturally 
on the ability to rank correctly an anomalous observation against a normal one. This is naturally 
captured by the following quantity: 

Py = P(y > X) + ^ P(y = X), where A ^ Fq and Y ^ Fy are independent. (11) 

The larger py is the higher is the probability of ranking the two observations correctly. 


Theorem 2. Refer to the hypothesis testing problem in Section 2.1 and consider the test that rejects 
the null i/q}(R) < a, where ^ is defined in (10), with b = ^ and b^/^ogN 0. This test 

has level at most a. Furthermore this test has power converging to 1 as N oo provided 


7^ E P’' ^ F + T^J2\og{N)|\S\ , 

PI v<iS ^ 


with T > —= fixed, 
2n/3 


and S belongs to S defined in (8) with qi - log 2 log A ->• +oo and Qu - log 2 N -*■ -oo as A ->• oo. 
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This result characterizes the performance of the rank scan test for general distributions (actually 
we do not even need to assume stochastically dominates Fq). To get a better sense of this result 
and be able to compare it with the previous theorem it is useful to consider the particular case of 
the exponential model. Define 

To= ^E[max(X,T)] , (12) 

where X,Y ~ Tq and independent. 

Proposition 2. Refer to the hypothesis testing problem in (3), assume Fq has zero mean and 
variance one, and refer to the test in Theorem 2. The test has level at most a. Moreover, it has 
power converging to 1 as N ^ oo when 

^ > T^/2Aog(NjJ\S\ , with r > ^ fixed. 

I‘^l ViS dTo 

The headline here is that the rank scan requires a signal amplitude which is l/(2V3To) larger 
than what is required of the regular scan test calibrated by Monte Carlo with full knowledge of 
the null distribution. This is (qualitatively) in line with similar results in more classical settings 
(Hettmansperger, 1984). For the normal location model, we find that l/(2\/3To) = y/'7r/3 Ri 1.023, 
so the detection threshold of rank scan is almost the same as that of the regular scan test — see the 
Appendix 7.5.2 for details. Note that Tq < 1/(2V3) (otherwise this would contradict the known 
minimax lower bounds) and that equality is attained if and only if Fq is the uniform distribution.^ 

Remark 3. As in the case of Theorem 2 the assumption on the moments of Fq are used only for 
clarity of presentation. In general, the permutation-based test is asymptotically powerful under the 
alternative if there is a set 5 e S such that 

^ X; ^ r^/2\og{N)l\S\ , with r > ^ -— fixed, 

PI 2\/6{Iq-P-q! 1) 

where pQ denotes the mean of Fq. 

The proof of Theorem 2 starts essentially as that of Theorem 1. Under the alternative the 
P-value is upper bounded by an expression involving SCAn(R). Control of this term is more com¬ 
plicated than that of SCAn(X) in the previous theorem, since the elements of R are not independent, 
but can be done by controlling the first two moments of R. For Proposition 2 we note that for the 
exponential model one can relate pv = pg^ to 9v by a Taylor expansion around zero, concluding the 
proof. 

Small and very small intervals 

The conditions of Theorem 2 allow for dealing with intervals of length of order (strictly) larger than 
logN. We give here results that encompass the scenario where the interval might be of smaller 
length. To keep the discussion simple we consider the class of intervals of a fixed size |5| = k under 
the alternative, and explain later how this result is generalized for a class of intervals of different 
sizes. In this situation there is no need to consider an approximating net and we simply scan over 
the entire class, denoted by S. Recall the definition of the permutation P-value (10). 

Proposition 3. Refer to the hypothesis testing problem in Section 2.1 and consider the test that 
rejects the null i/^(R) < a. Then the test has level at most a and power converging to 1 as N ^ oo 
provided there is an interval S of length k such that 

^This is based on a personal communication from Richard J. Samworth and Tengyao Wang, who got interested 
in this question after one of the present authors presented this work at Cambridge University. 
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(i) Zv^sPv = 1 -o{N ‘^1^) when 2<k = o(logiV); or 
(a) T,viSPv > 1 ~ 5 6xp(-^) when k = clog for some c> 0 fixed. 

Theorem 2 and Proposition 3 together cover essentially all interval sizes which are o{N). The¬ 
orem 2 covers the case of larger intervals, in which case T,viSPv can go to 1/2 provided it does 
not converge too fast, and the test is still powerfnl asymptotically. In Proposition 3, a sufficient 
condition for an asymptotically powerful test is that T,vgS Pv to 1 at a certain rate when 
the size of the anomalous interval is o(logiV). If the interval size is clogAI with c > 0 arbitrary 
the rank test is asymptotically powerful when T,v^s Pv is greater than a constant (strictly larger 
than 1/2) depending on c. 

Extending this result to the exponential model is not possible without additional knowledge of 
the family of distributions, as having bounded away from 1/2 implies 9y is bounded away from 
0. As an example, consider the normal means model when k = o(\ogN). In this case, we have 

pe = 4 >(- 6 »/ n / 2 ) > 1 - ^e~^ . 

Hence, whenever the condition in the proposition is met. This is satisfied when 

6 = t\/ 2 log{N)/k, with r > 2 fixed. (13) 

This means that in this case the rank scan requires an amplitude at most two times larger than 
the regular scan test calibrated with full knowledge of the null distribution. 

Finally note that the condition T-viSPr ^ ^ Hv^sPv > 1 - ^exp(-^) might not be 

possible to meet for certain distributions of the exponential family. For instance, in Example 1, 
T,viSPv = 3/4, a case not covered by Proposition 3 when the interval size is smaller than clogA^ 
and c is small enough. But this is expected since no test has any power if c is sufficiently small. 

Remark 4. Proposition 3 considered the case when the size of the anomalous interval is known. 
However, we could consider the class of intervals of length greater than 2 and at most k for some 
given k - O(logiV). In this case we would simply scan the ranks for every fixed interval size up to 
k and apply a Bonferroni correction to the P-values. Following through the steps of the proof, one 
can see that the rank scan test would be asymptotically powerful when 
(i’) HviSPv = 1 -o(AHogiV)"2/l‘5| when 2 < |5| = o(logAI); or 
(h’) T,viSPv > 1 ~ I 6xp(-^) when |5| = clog A" for some c> 0 fixed. 

For the normal location model and considering k = o(log A), we can see that this is satisfied when 
(13) holds. 

5 Numerical experiments 

5.1 Computational complexity 

We already cited some works where fast (typically approximate) algorithms for scanning various 
classes of subsets are proposed (Arias-Castro et ah, 2005; Neill, 2012; Neill and Moore, 2004; 
Walther, 2010). For example, as we saw in Lemma 1, Arias-Castro et al. (2005) design an approxi¬ 
mating net Sft for the class of all intervals S that can be scanned in 0(A64^). Furthermore, we saw 
in Proposition 1 that this procedure achieves the optimal asymptotic power as long as b - bjsf ^ oo. 
For example, if b]^ x log log A, then the computational complexity is of order (Apolylog(A)). 
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In any case, suppose that a scanning algorithm has been chosen and let Cn denote its compu¬ 
tational complexity. The oracle scan test and the rank scan test are then comparable, in that they 
estimate the null distribution of their respective test statistic by simulation, and this is done only 
once for each data size N. With this preprocessing already done, the computational complexity of 
these two procedures is Cat, the cost of a single scan when applied to data of size N. In contrast, the 
permutation scan test is much more demanding, in that it requires scanning each of the permuted 
datasets, and this is done every time the test is applied. Assuming B permutations are sampled at 
random for calibration purposes, the computational complexity is BCn, that is, B times that of the 
oracle or rank variants (not accounting for preprocessing). B is typically chosen in the hundreds 
(B = 200 in our experiments), if not thousands, so the computational burden can be much higher 
for the permutation test. 

5.2 Simulations 

We present the results of some basic numerical experiments that we performed to corroborate our 
theoretical findings in finite samples. We generated the data from the normal location model — 
where Fq - J\f{9,l) — which is arguably the most emblematic one-parameter exponential family 
and a popular model in signal and image processing. We used the regular scan test, calibrated with 
full knowledge of the null distribution, as a benchmark. The permutation scan test and rank scan 
test were calibrated by permutation. 

The test statistic that we use in our experiments is the scan over all intervals of dyadic length. 
This subclass of intervals is morally similar to So (corresponding to 6 = 0) but somewhat richer. 
This choice allows us to both streamline the implementation and make the computations very fast 
via one application of the Fast Fourier Transform per dyadic length. In detail, letting S denote the 
class of all discrete intervals in V, this amounts to taking as approximating set 

Sdyad = e S : |‘5| = 2-^ some j e n|. 

As explained earlier, the calibration by permutation and the rank-based approach are valid no 
matter what subclass of intervals is chosen, and in fact, the same mathematical results apply as 
long as the subclass is an appropriate approximating net. We encourage the reader to experiment 
with his/her favorite scanning implementation. 

It is easy to see that, for each 5 e S, there is S* e Sdyad with S* c S and |5*| > |5|/2. Hence, 

mi n max p{S,S*) >ll\/2. 

A priori, this implies that scanning over Sdyad requires an amplitude \/2 larger to achieve the same 
(asymptotic) performance as scanning over S or a finer approximating set as considered previously. 
To simplify things, however, in our simulations we took an anomalous interval of dyadic length, so 
that the detection threshold is in fact the same as before. 

We set N = 2^® and tried two different lengths for the anomalous interval |5| e {2^, 2^^}. All the 
nonzero were taken to be equal to 

Os = t^J2\og{N)|\S\ (14) 

with t varying. The critical values and power are based on 1000 repeats in each case. A level of 
significance of 0.05 was used. Also, 200 permutations were used for the permutation scan test. The 
results are presented in Figure 5.2. At least in these small numerical experiments, the three tests 
behave comparably, with the rank scan slightly dominating the others. Although the last finding is 
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somewhat surprising, this is a hnite-sample effect and is localized in the intermediate power range 
(around a power of 0.5) and so does not contradict the theory developed earlier. In fact, the three 
tests achieve power 1 at roughly the same signal amplitude, confirming the theory. 

N = 2'^|S|=2^ N = 2'®, |S|=2'" 




t t 

Figure 1; Power curves (with 95% margin of error) for the three tests (all set at level 0.05) as a 
function of the parameter t in (14): the scan test calibrated with knowledge of the null distribution 
(black); the permutation scan test (blue); and the rank scan test (red). On the left are the results 
for |5| = 2^ and on the right for |5| = 2^*^. N = 2^^ in both cases. Each situation was repeated 1,000 
times and each time 200 permutations were drawn for calibration. The vertical black dashed line 
is the minimax boundary for t. The horizontal black dashed line is the significance level 0.05. 


5.3 Comparison with RSI 

Next, we compare our rank scan with the robust segment identifier (RSI) of Cai et ah (2012). 
This is a recent method based taking the median over bins of a certain size (a tuning parameter 
of the method) and then scanning over intervals. Because the median is asymptotically normal, it 
allows for a calibration that only requires the value of the null density at 0. In turn, one can try 
to estimate this parameter. Although the method is not distribution-free proper, it appears to be 
the main contender in the literature. We first compare the two methods on simulated data, for in 
the context of detection (the problem we considered so far) and in the context of identification (a 
problem considered in that paper). 


Detection In the problem of detection, we compare the performance of the rank scan test and 
RSI with bin size m e {10,20} in normal data. To turn RSI into a test, we reject if it detects 
any anomalous interval. In the simulation, we set sample size N = 50,000 and considered the case 
where there is only one signal interval with known length |5| e (100,1000}. The amplitude satisfy 
(14) as before. We report the empirical power curves (based on 100 repeats) in Figure 2. 
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N = 50000,|S| = 100 N = 50000,|S| = 1000 




t t 

Figure 2: Power curves (with 95% margin of error) for the three tests as a function of the parameter 
t in (14): the rank scan test (red); RSI with bin size 10 (solid green); and RSI with bin size 20 
(dashed green). The rank scan test is set at level 0.05 and its critical value is from 1000 repeats. 
On the left are the results for |5| = 100 and on the right for |5| = 1000. N - 50,000 in both cases. 
Each situation was repeated 100 times. The vertical black dashed line is the minimax threshold for 
t. The horizontal black dashed line is the significance level 0.05. 

To be fair, both methods only scan candidate signal intervals of length |5|. The rank scan 
is calibrated as before. For RSI, we set the threshold to \/21og for the normalized data after 
localization to better control the family-wise type I error as explained in (Cai et ah, 2012). From 
Figure 2, we can see that RSI is a bit more conservative. In fact, a drawback of RSI is the difficulty 
to calibrate it correctly.® In any case, the rank scan test outperforms RSI in these simulations. 

Identification In the problem of identification, we compare the rank scan and RSI. Although we 
focused on the problem of detection so far, a scan can be as easily used for testing as for estimation 
(i.e., identihcation). Indeed, one sets an identification threshold and extract all the intervals that 
exceed that threshold. Some post-processing — such as merging significant intervals that intersect 
or keeping the most significant among significant intervals that intersect — is often applied. 

Here, in an effort to be fair, we simply took the procedure of (Cai et ah, 2012) — which is 
essentially the procedure of (Jeng et ah, 2010) — but calibrating as we did for testing. Note 
that this implies a very stringent false identification rate (at the 0.05 testing level this means that 
the chances that one or more intervals are identified by mistake is 0.05). We then compare its 
performance to that of the rank scan testing procedure calibrated in the same fashion. 

Following (Cai et ah, 2012), in the simulation, we set the sample size to = 10^. We consider a 
range of null distributions: the standard normal distribution, the t-distribution with 15 degrees of 
freedom and that with one degree of freedom. In each case, we set the signal mean to 9s e {1,1.5,2}. 
There are three signal intervals, 5i, ^ 2 , cSs, starting at positions 1000, 2000, 3000, and having lengths 
2^, 2®, 2®, respectively. We set the threshold for the rank scan test by simulation at a significance 
level of 0.05. For RSI, we tried several bin sizes, m e {2®, 2®}. To simplify the computation, both 

®Of course, it could be calibrated by permutation, but this would make the procedure much more like the per¬ 
mutation scan test (with the same high-computational burden), somewhat far from the intentions of (Cai et ah, 
2012 ). 
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methods only scan dyadic intervals of length at most 2®. As in (Cal et ah, 2012), we compare their 
performance in terms of the following dissimilarities 

Dj = min{l - p{Sj,S)}, 

cSeS 

and the number of false positives, namely 

O = {5 e S : 5 n 5 = 0 , V5 e S}, 
where S are the estimated signal intervals. 

We report the average and standard deviation (in the parenthesis in the tables below) based on 
200 repeats in Tables 1, 2, and 3. We can see that the rank scan method performs better than RSI 
in when the null distribution is normal and t(15), and it performs similarly to RSI with bin size 
m = 2^ in t(l). However, when the bin size of RSI is not properly chosen, RSI can perform poorly. 


Table 1: Dissimilarity and number of over-selected intervals in W(0,1) 


^5 

Method 

Di (| 5i | = 2 -) 

D2(|52| = 23 ) 

D3(|53| = 23 ) 

#o 

1 

Rank Scan 
RSI(m = 23 ) 
RSI(m = 2 ®) 

0.734 ( 0 . 421 ) 
0.916 ( 0 . 235 ) 
0.998 ( 0 . 029 ) 

0.148 ( 0 . 284 ) 
0.420 ( 0 . 406 ) 
0.959 ( 0 . 144 ) 

0.031 ( 0 . 049 ) 
0.095 ( 0 . 091 ) 
0.326 ( 0 . 278 ) 

0.000 ( 0 . 000 ) 
0.065 ( 0 . 267 ) 
0.130 ( 0 . 337 ) 

1.5 

Rank Scan 
RSI(m = 23 ) 
RSI(m = 2 ®) 

0.167 ( 0 . 326 ) 
0.593 ( 0 . 391 ) 
0.980 ( 0 . 087 ) 

0.019 ( 0 . 044 ) 
0.132 ( 0 . 033 ) 
0.729 ( 0 . 284 ) 

0.008 ( 0 . 012 ) 
0.069 ( 0 . 029 ) 
0.204 ( 0 . 044 ) 

0.000 ( 0 . 000 ) 
0.080 ( 0 . 272 ) 
0.025 ( 0 . 157 ) 

2 

Rank Scan 
RSI(m= 23 ) 
RSI(m = 23 ) 

0.018 ( 0 . 051 ) 
0.277 ( 0 . 226 ) 
0.960 ( 0 . 122 ) 

0.006 ( 0 . 024 ) 
0.128 ( 0 . 021 ) 
0.476 ( 0 . 162 ) 

0.004 ( 0 . 008 ) 
0.064 ( 0 . 013 ) 
0.193 ( 0 . 032 ) 

0.000 ( 0 . 000 ) 
0.065 ( 0 . 247 ) 
0.010 ( 0 . 100 ) 


5.4 Application to the real data 

In this section, we apply the methods to the problem of detecting the copy number variant (CNV) 
in the context of next generation sequencing data. We compare the rank scan method and RSI 
on the task of identifying short reads on chromosome 19 of a HapMap Yoruban female sample 
(NA19240) from the 1000 genomes project (http://www.1000genomes.org), which is the same 
data set used in (Cai et al., 2012). Following standard protocols (Ernst et ah, 2011), we extend 
all the reads to 100 base pairs (BPs). We take 10® reads from the whole data set for comparison 
purposes resulting in 1,281,502 genomic locations. 

We tune RSI as done in (Cai et ah, 2012), setting the bin size to m = 400 and the maximum 
BPs in a possible CNV to L = 2^®. Note that (Cai et ah, 2012) took L = 60,000, which is a bit 
smaller than 2^®. (We chose the latter because we only scan intervals of dyadic length.) To save 
computational time, in the implementation of the rank scan we group read depths in every 200 
positions and take the summation of the read depths for each bin and use that as input (meaning, 
we rank the sums and scan the ranks). We get the critical value for the rank scan method under 
the significance level 0.05 from 1000 repeats. In the experiment, we let RSI and the rank scan 
method only scan dyadic intervals of lengths from 2^ to 2^®. 




19 


Table 2; Dissimilarity and number of over-selected intervals in t(15) 


Ss 

Method 

Di(|5i| = 2-) 

D2(|52| = 23) 

D3(|53| = 23) 

#o 

1 

Rank 

Bean 

0.806 

(0.369) 

0.223 

(0.354) 

0.029 

(0.048) 

0.000 

(0.000) 


RSI(m 

= 23) 

0.926 

(0.223) 

0.436 

(0.406) 

0.106 

(0.099) 

0.050 

(0.218) 


RSI(m 

= 2^) 

0.996 

(0.041) 

0.944 

(0.168) 

0.336 

(0.278) 

0.125 

(0.332) 

1.5 

Rank 

Bean 

0.232 

(0.378) 

0.026 

(0.079) 

0.010 

(0.017) 

0.000 

(0.000) 


RSI(m 

= 23) 

0.554 

(0.391) 

0.143 

(0.112) 

0.069 

(0.031) 

0.075 

(0.282) 


RSI(m 

= 2®) 

0.992 

(0.057) 

0.732 

(0.286) 

0.199 

(0.042) 

0.020 

(0.140) 

2 

Rank 

Bean 

0.034 

(0.097) 

0.009 

(0.019) 

0.005 

(0.014) 

0.000 

(0.000) 


RSI(m 

= 23) 

0.277 

(0.220) 

0.128 

(0.022) 

0.063 

(0.013) 

0.060 

(0.238) 


RSI(m 

= 23) 

0.968 

(0.107) 

0.521 

(0.214) 

0.192 

(0.030) 

0.010 

(0.100) 


Table 3: 

Dissimilarity and number 

of over- 

-selected intervals in t(l) 


Method 

Di(|5i| = 24) 

D2(|52| = 23) 

^3(1531 = 2*^) 

#o 

1 

Rank 

Bean 

0.989 

(0.082) 

0.878 

(0.305) 

0.461 

(0.448) 

0.000 

(0.000) 


RSI(m 

= 23) 

0.950 

(0.186) 

0.764 

(0.370) 

0.332 

(0.358) 

4.305 

(5.653) 


RSI(m 

= 23) 

0.998 

(0.022) 

0.982 

(0.098) 

0.609 

(0.392) 

0.520 

(0.501) 

1.5 

Rank 

Bean 

0.922 

(0.251) 

0.542 

(0.455) 

0.067 

(0.132) 

0.000 

(0.000) 


RSI(m 

= 23) 

0.843 

(0.307) 

0.342 

(0.354) 

0.104 

(0.080) 

3.920 

(2.082) 


RSI(m 

= 23) 

0.983 

(0.079) 

0.877 

(0.236) 

0.225 

(0.111) 

0.055 

(0.229) 

2 

Rank 

Bean 

0.763 

(0.410) 

0.206 

(0.333) 

0.043 

(0.093) 

0.000 

(0.000) 


RSI(m 

= 23) 

0.619 

(0.382) 

0.154 

(0.121) 

0.089 

(0.063) 

3.945 

(2.385) 


RSI(m 

= 23) 

0.978 

(0.090) 

0.667 

(0.280) 

0.208 

(0.05) 

0.060 

(0.238) 


After merging the contiguous selected segments, RSI found 30 possible CNVs and the rank scan 
method selected 34. Figure 3 shows the histograms of the read depths of the selected CNVs. We 
can see the read depth in the rank scan method is generally larger than that in RSI. 


6 Discussion 

In this paper we consider a prototypical structured detection setting with the particularity that 
the null distribution is unknown. When the null distribution is known, various works have shown 
that a form of scan test achieves the best possible asymptotic power. When the null distribution 
is unknown, one can alternatively calibrate the scan test by permutation. This has been suggested 
a number of times in the detection literature. Theorem 1 implies doing this results in no loss 
of asymptotic power compared to a calibration by Monte Carlo with full knowledge of the null 
distribution. To circumvent the expense of calibrating by permutation, we propose to scan the 
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RSI 


Rank Scan 



4.0 4.5 5.0 5.5 6.0 



5.0 5.5 6.0 6.5 7.0 


size of the CNVs identified 


size of the CNVs identified 


Figure 3: Histogram of the read depths of the selected CNVs in log scale (base 10). Both methods 
only scan dyadic intervals of lengths from 2^ to 2^®. The RSI used a bin size m = 400, while the 
rank scan was calibrated as for testing. 


ranks. Theorem 2 and Proposition 2 imply that this results in very little loss in asymptotic power. 
In our empirical experiments all three methods perform comparably. Generalizations to multivariate 
scenarios are also possible (e.g., e with d > 1). The exact procedure will depend heavily on 
the specific problem context. For instance, in imaging contexts the entries of Xy correspond to 
measurements in different wavelengths that might be suitably combined in a single univariate score. 

Censoring before permutation. When Fq is not of compact support, we can enforce it by applying a 
censoring of the form Xy = Xyl^^Xy\<t} +isign(Vt,)l||jjf^|>f}. With a choice of threshold t = tx ^ oo 
slowly (e.g., tN = log log iV), Theorem 1 applies with T,vis(^v replaced by min^g^^^ and without 
an upper bound on the 9yS. The proof of this result is nearly identical except for very minor 
modifications. This censoring has the added advantage of making the method more robust to 
possible outliers. 

Other scoring functions. Although rank-sums are intuitive and classically used, any scan based on 
h{ry), where h is increasing, is valid. (Recall that ry is the rank of Xy in the sample.) In two- 
sample testing, it is known that there is no uniformly better choice of function h. See (Lehmann and 
Romano, 2005, Sec 6.9) where it is shown that choosing h{r) = E(Z(^)) — where < ■■■ < Z(^x) 
are the order statistics of a standard normal sample — is (in some sense) optimal in the normal 
location model. Our method of proof applies to a general h. 

Unstructured subsets. No permutation approach (including a rank-based approach) has any power 
for detecting unstructured anomalies. A prototypical example is when S is the class of all subsets, 
or all subsets of given size, the latter including the class of singletons. 

7 Proofs 

7.1 Proof of Theorem 1 

Suppose first we are under the null hypothesis. Note that X = {Xy,v e V) are IID under the null, 
and therefore exchangeable. This means that, for any permutation vr the marginal distributions 
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of SCAn(X) and SCAN(X7r) are the same. This implies that SCAn(X) is conditionally uniformly 
distributed on the set {sCAN(X7r), vr e V!} (with multiplicities) and so 

P(|{ 7 r € V! : SCAN(X^) > SCAN(X)}| < aVl) < < a , 

where [z\ denotes the integer part of z. If there were no ties, the first inequality above would be an 
equality, but with ties present the test becomes more conservative. For more details on permutation 
tests the reader is referred to (Lehmann and Romano, 2005). 

All that remains to be done is to study the permutation test under the alternative hypothe¬ 
sis. This requires two main steps. First we need to control the randomness in the permutation, 
conditionally on the observations x. Once this is done we remove the conditioning. 

The key to the first step is the following Bernstein’s inequality for sums of variables sampled 
without replacement from a finite population. 

Lemma 2 (Bernstein’s inequality for sampling without replacement). Let (Zi,..., Zm) he obtained 
by sampling without replacement from a given a set of real numbers {zi,... ,zj} c M. Define 
Zmax = maxj Zj, z = j Zj, and o'z = j T,j{zj - Then the sample mean Z = satisfies 


¥ [Z > z + t) < exp 


mt^ 

2(7^ + ^(.Zixiax ~ z')t 


Vt>0. 


This result is a consequence of (Hoeffding, 1963, Th. 4) and Chernoff’s bound, from which 
Bernstein’s inequality is derived, as in® (Shorack and Wellner, 1986, p 851). See (Bardenet and 
Maillard, 2013; Boucheron et ah, 2013; Dembo and Zeitouni, 2010) for a discussion of the literature 
on concentration inequalities for sums of random variables sampled without replacement from a 
finite set. 

Applying this result for a fixed (but arbitrary) set 5* e S;, when vr is uniformly drawn from V! 
and X is given, we get 


F ^15* (x^) - \/i^x >t^ < exp 


_ I _ 

2af. + |(a:max - x)t/^/\^ 


Vt > 0, 


using the same notation as in Lemma 2. Plugging in t = SCAn(x), noting that 15*1 > 2'^*/(l + 2 > 

2^'-12 eventually (because 6 ->• 00 ), and using this together with a union bound, we get 


^(x) < |Sfe|exp 


SCAn(x)® 

2(7® + (Xmax “ x)2"'?*/®SCAN(x) 


(15) 


(The I in the denominator, when multiplied by \/2, from |5| > 2'?'/2, is still less than 1.) 

Now we proceed by upper bounding the right-hand side of the above inequality by assuming 
we are under the alternative, which yields an upper bound for the P-value ip(X). This amounts to 
controlling the terms Xmax -X, and SCAn(X) under the alternative (upper-case X relates to 
the random quantities.) 

Recall that Fq has zero mean and unit variance and note that and Var 0 (X) are continuous 

in 9 (and thus bounded on the interval [ 0 , 0 ]). 

®There is a typo in the statement of the result in (Shorack and Wellner, 1986, p 851), but following the proof one 

A A^ 

can find the correct result. Where the statement of the result reads we should have instead 

2cr-^ 2cr'^ 
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We begin by controlling X^ax - X. Let S denote the anomalous interval under the alternative. 
We have 

+ ^yiX,-KiX,)) = Oi\S\/N) + opil) = opil) , 

as Ai ->• oo, since |5| = o{N),9y < 6 for all u e V, and using Chebyshev’s inequality in the second 
equality. Furthermore, let Xjnax,s = Xy be the maximum over the anomalous set S. Let S 

denote the complement of 5 . A union bound together with X^ax = Xmax ,5 ^ ^max 5 implies 

lP(Xmax >x) < P(Xmax,5 > *) + g > x) < \S\Fg{x) + |cS|.Po(a^) , 

where Fq(x) = Pe)(X > x) and we used the fact that Fq(x) is monotone increasing m 8 - see 
Section 1.2. For c e (0, 9* - 9), we have 


Jf' oo - - 

e®“-^°g‘^°(®)dFo(u) 

X 


1 ^ oo - 

ipo{9) Jx 




Using this with the above union bound gives P(Xmax > (2/c)logA^) ^ 0 as A^ ^ oo. This and the 
bound on X imply that 

P(Xn,ax-X>(3/c)logA^)^0 . 

We now consider a^. Similarly as before, we have 


4 = ^ E(^» - V)" W E = 1 y iE(.<f) +1 y (xj - nxj,)). 

V€V V€V V€V V€V 


On one hand. 


I E = I E Var(A,) + E ^(Var(A,) +E(A,)2) 

ueV v^S v€S 

N \N j ^ ^ ’ 


using Var(At,) = 1 for u 5, max^^^ Var(A^) < oo and max„g 5 E(A„) < oo (since max^j^^ 9y < 9), as 
well as our assumption that |5| = o{N). On the other hand. 


E^(x2-E(A2)) = Op(l/y]v) , 


using the fact that maxt,^^ E(A(^) < oo (since max^^^ 9y < 9) combined with Chebyshev’s inequality. 
We may therefore conclude that 

P(4<l + e/4)^l , 

with a fixed but arbitrary e > 0 (we will choose an appropriate value for e later on). 

From Lemma 1 (which does apply to the newly defined Sf,) there is a set S* e Sfe such that 
S ^ S* and p{S,S*) > (1 + Note that p{S,S*) = 1 - o(l) by the fact that b oo. We 

then have 


SCAN(X) > Xs* - ^^X = ^^(X5. - X) 
\ IcS’^lA^ N 
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where X 5 and Xvn 5 are the averages of the components of X over the sets S and V \ 5 respectively. 
By Chebyshev’s inequality, 

^s = T^^T.nx.) + Op{l/^\) , 

PI vgS 

^v.s = Op{l/^/N-\S\) . 


Recall that we have 


ISlS"’ 


> T 


21 og 

|5| 


(16) 


Note that converges to zero by the assumption on qi and the fact that r is fixed. Furthermore 
E 0 (X) is increasing in 9 (as ^E 0 (X) = 'E,q{X‘^) > 0) and E 0 (X) = 6 + 0{6^) when 0^0 (this 
can be checked by noting E 0 (X) = J xe^^dFQ{x) and writing the Taylor expansion of around 
zero). Thus 'EveS^i^v) ^ E 5 )j(X) = + 0(9^) because ^ 0. Using = (1 + o(l))A/j^ 

and |5| = o{N) we get 

SCAn(X) > (1 + o(1))ta/ 2logiV + Op{l) , 


therefore 

scan(X) > a/ 2(1 + e/2)logA^ , 

with probability tending to one as 00 , where we take e so that r = VTT~e. 

We are ready to make use of the upper bound on the P-value given by (15) and using the 
condition on qi we get 


iogqj(x)<iog|§fe|- 
< log |§fe| - 


2(1 +e/ 2 ) log A^ 


2(1 + e/4) + (3/c)(logX)y^2"'?i+i(l + e/2) log A^ 
(1 + e/ 2 ) log A 1 


1 + e/4 + 0 ( 1 ) 

with probability going to 1. For the size of the approximating net we have 

log |Sfe| < log (A^4^’^^) = log A^ + (6 + 1) log4 = (1 + o(l)) log A^ , 


(17) 


by our assumption on b. Combining these allows us to conclude that logfp(X) -00 (meaning 
^(X) ^ 0 ) with probability tending to one, implying that the test has power tending to 1 as 
A^ ^ 00 , concluding the proof. 


7.2 Proof of Theorem 2 

The arguments used for the general permutation test apply verbatim under the null hypothesis, so 
all that remains to be done is to study the performance of the rank scan test under the alternative. 
We may directly apply (15), to obtain 


qi(r) < ISfclexp 


SCAN(r)^ 

^ + y2"'Ji/2sCAN(r) 


(18) 


where we used al = {N'^ - 1)/12 < Al^/12, rmax = N and f = {N + l)/2, so that rmax - r < N/2. The 
previous bounds can be directly computed when there are no ties in the ranks, and it is easy to 
verify that they also hold if ties are dealt with in any of the classical ways (assigning the average 
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rank, randomly breaking ties, etc). As before, this is a result conditional on the observations X = x 
and hence the ranks R = r. The next step is to remove this conditioning, which now amounts to 
controlling the term SCAN(R). 

Let S denote the anomalous interval under the alternative. From Lemma 1 there is a set S* e S;, 
such that S ^ S* and p{S,S*) > (1+ therefore p{S,S*) = l-o(l) by the fact that b ->■ oo. 

Since 

scan(r) > y5*(R) - , 

we focus on obtaining a lower bound on L 5 *(R) that applies with high probability. 

Note that 

E(y5*(R)) = ^ ^ E(R„) , 


and 

Var(R50 = ^ f E Var(R,) + E Cov(R„R^) 

In an analogous fashion to that in (Hettmansperger, 1984), we can make the following claims about 
the first two moments of the ranks. 



Lemma 3. Suppose Zi Fi,i € [s] and independent, also independent o /which are 
i.i.d. and distributed as Fq. Let Ri denote the rank (in increasing order) of Zi in the combined 
sample, and suppose ties are broken randomly. Define 


Pij = F{X>Y) + lF{X = Y) , 


where X Fi,Y -- Fj are independent. For i e [s] 


E(R0 = • 


(n - s)pip + E PiJ + 1 


n+s+1 

2 


E Pjfi 

Ms] 


, when i e [s], 
, when i [s]. 


Furthermore, as n,s oo,s = o{n), for i e [s] 


Var(Ri) = {Xi-pjfijn^ + 0{sn) , 


where 

Xi = E({A > Yi} n{X> Ta}) +E(^ = >1 > L 2 ) + ^E(X = Ti = Y 2 ) , 
where X Fi and Li,F 2 ~ Fq are jointly independent. Finally, for any i,j e [n] 


Cov{Ri, Rj) = 0(n) . 


For the sake of completeness we sketch a proof of Lemma 3 in Appendix 7.5.1. Recall the 
definition of pv in (11) and pv,w in Lemma 3. Using the fact that for any i,j we have pij +Pj,i = 1 
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we get 

^\^\E{Ys*{K))= Y.HRv)=Y.^{Rv)+ E 

V&S* V&S v£S*\S 

= E ((-^“ E Pv,'w + '^^+ E (K-^l‘^l E 

v^S w£S,wi:v v^S*\S w^S 

= |5|(iV - \S\)Ps + E E + l‘5| + 15* X 5|i(iV + |5| + 1) - |5* X S\\S\Ps 

v^S w^S, wi^v 

= \S\{N - |5| - |5^ X S\)ps + ||5|(|5| + |5^ x S\) + ||5| + |5^ x S\Rf 

= |5|(iV - |5| - X S\){ps - 1/2) + \S\R^ + \S* X S\Rf 

= \S\{N - |5| - X S\){ps - 1/2) + \S*\Rf , 


where = |^ T,viSPv is the average of over the anomalous set. 

Note that for any v e [N] we trivially have Var(i2^) < N'^, and by Lemma 3, Cov{Ry, R^) = 
0{N), so Var(y 5 *(R)) = 0{N‘^). Hence, using Chebyshev’s inequality we obtain 


Ys*{R.)-^\Rf 


| 5 | 


(1V-|5|-|5" 


x5|)fe-1/2)+ Op(iV) 


> p{S,S*){N - o{N))t^/ 2logN + Op{N) , 


(19) 


where we used the condition on Qu to conclude that |5*| + |5* x 5| = o{N). In summary we have 


N 


scan(R) > c— -J2 log IV , 
2^3 


with probability going to 1 as oo, where c e (1, 2r\/3). 

Plugging this back into (18) and accounting for the condition on qi we get 


logq}(R) < log|Sb| - 


< log|§b| - 


c%\ogN 

f+ f 
log N 

T7^ ’ 


with probability going to 1. Noting that the upper bound on |Sb| in (17) still holds and that c> 1 
allows us to conclude that log^(R) ->• -oo as N" oo, hence the test is asymptotically powerful. 


7.3 Proof of Proposition 2 


Showing this result amounts to relate = Pe^ with 0^. This is conveniently done by a Taylor 
expansion around zero. For ease of presentation let 6 = 0^ m. what follows. When Fq is discrete, we 
have 

Pe = f {Re{x) + ^f0{x)Fo{x))dFo{x) . 

We expand the integrand seen as a function of 9 around 0 = 0 up to a second order error term. We 
have 


mfeix) 


X, mFe{x) 


6»=0 


0=0 


/ u dFo(u) , 
J (ai,oo) 
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where the second identity comes from differentiating inside the integral defining justified by 
dominated convergence. Note that ■§^fe{x) is integrable w.r.t. Fq when 6 e [0,0*) and the same 
holds for -^Fg(x) as well. Hence let 


Co := J sup ^/e(x) 


6»e[0,6»] 


dFo(x) < oo , and 


e=e 


Co := / sup ^Fg{x) 
ee[0,6»] 


dFo(x) < 


e=e 


Therefore 

Pe> f Fo{x) + IFo{x) + 9 ( f u dFo(u) + |Fo(x)x') dFo(x) - ^(cq + Co/2) 

J \J(x,oo) j 

= Po + ^(lEo(-^l{x>y}) + I IEo(Xl|x=y})) - ^(*^0 + Co/2) 

= I + 0To - ^(co + Co/2) . 


When Fq is continuous, we have 

Pd = f Fe{x)dFo{x) , 


and similar calculations lead to 

Pe>^ + O'^o - yco . 

In summary, we conclude that pg > ^ + OTq + 0(0^) as 0 ^ 0. In addition, note that pg is 
monotonically increasing in 0, by virtue of the fact that (Fg '■ 9 > 0) has monotone likelihood ratio. 
Therefore, 


1 


T^Pe-. 

v^S 


> - + tTo 
2 


2 logN" 


|5| 


+ 0 


2 log N \ 

■ 


Finally, using the above bound in (19) and proceeding in an analogous fashion as in Theorem 2 
yields the desired result. 


7.4 Proof of Proposition 3 

We treat each case separately. 

Condition (i). The same arguments hold as before under the null, so again we are left with studying 
the alternative. To deal with smaller intervals, we need a slightly different concentration inequality 
than before. 


Lemma 4 (Chernoff’s inequality for ranks). In the context of Lemma 2, assume that Zj = j for all 
j. Then 


where 


' [Z > z +1) < exp{-msupx>o'4’{'l^: ^)) > 

\nsmh(A/2) j 


Vt>0 , 


Similarly to Lemma 2 this result is also a consequence of Theorem 4 of Hoeffding (1963) and 
Chernoff’s bound. However, with the assumption on Zj in the lemma above we can directly compute 
the moment generating function of Zj after using Chernoff’s bound instead of upper bounding it, 
as is classically done to obtain Bernstein’s inequality. 
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In the present context, this yields 

^(r) < |S| exp ^-A:^(sCAN(r)/Vfc, A)^ , VA>0. 

Note that x < sinh(x) < e*/2 and |S| < N, hence 

^(r) < A^exp ^-AN/fcsCAN(r) + - A;log(AA^) j , VA > 0 . (20) 

The next step is to remove the conditioning R = r and bound SCAn(R). Recall SCAn(R) > 
^^(R)where 5 is the anomalous interval. As in the proof of Theorem 2 we use Lemma 3 
to evaluate the terms E(l 5 (R)) and Var(y 5 (R)). We have 

E(y5(R)) = '/kiN - k){Ps - 1/2)) + , 

where we use the shorthand notation = |^ Hv^s Pv- For the variance term, recalling the definition 
of A„ from Lemma 3, we note that A^, <pv Hence 

Var(i?„) = {Xy - pl)n‘^ + 0{kN) < p^{l - py)N^ + 0{kN) < {1 - p^)N^ + 0{kN) . 

Also using Cov{Rv, Rw) = 0{N), we get 

Var(y5(R)) < (l-p5)iV2 + 0(A:Ar) . 

According to our assumption, there exists a sequence lon ^ such that 

For reasons that become apparent at the end of the proof, we choose con oo not too fast (for 
instance con ^ log A" suffices). Using Chebyshev’s inequality we get 

P^y5(R) - Vk^ < Vk{N -k)(^l- j 

= E (y5(R) - E(y5(R)) < Vk{N -k)(l- co-^/^N-^/^ -ps)) 

< E (y5(R) - E(y5(R)) < -Vk{N - k) 

< E (|y5(R) - E(y5(R))| > Vk{N - k) 

^ A^w^ZA-2/^ + 0{kN) ^ dA^w^v^A-^/^ + 0{kN) ^ 

A:(A-A:)2(w“^/V-Vfc-a;-iAr-2/fcJ^ k{N - 

where the last inequality follows because I2 eventually as A ->• 

oo. Hence, 

SCAN(R)>\/fe(A-fe)(i-a;)v^/V-^/^) , 
with probability converging to 1 as A ->• oo. Using this with (20) we get 

log^(R) <logA+^ + AA:(A-A:)w"^/^A-^/*^-felog(AA) , VA > 0 , 
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with probability tending to 1. Choosing A = jN we get 


iogq}(R) < 


N 


,2 

k + -^/c- - logWAT 


with probability going to 1, where we used that ojm grows slowly enough for the first term to vanish. 

Condition (ii). We can mimic the arguments above. Suppose k = clog with arbitrary c > 0 and 
ps = l-(l-5) exp(-^)) := l-(l-5)/(c) with some <5 > 0. As before, using Chebyshev’s inequality 
we can show that 

SCAn(R) > \fk{N - ^) (^ - (l - 0 /(c)) , 

with probability tending to 1 as iV ->• cx). Plugging this into (20), choosing A = l/(iV/(c)) we get 


log^(R) < log + 


k^ fe(iV-fe)(i-|) 

—rx— +-^- k log fie) 

2f{c)N N ^ ’ 


with probability going to 1 as A^ oo. Plugging in k = clogn and /(c) = exp(-^) we see that 
the log of the p-value goes to -oo, which is what we wanted to show. 


7.5 Additional results 
7.5.1 Sketch proof of Lemma 3 

First, assume that there are no ties in the ranks, with probability one. Note that we can write 


Ri-l+ ^ - 1 + E 


+ E '^{Z^>Z,} ■ 


Taking expectation yields 


E(Ri) = • 


1 + (n - s)pifi + E Pi,j 


n+s+l 

2 


E Pjfi 

idd 


, when i € [s], 
, when if [s]. 


since ^{Zi = Zj) = 0 for i ^ j when there are no ties. The variance and covariance terms can be 
worked out using the same representation of the ranks as above, but we omit these straightforward 
computations for the sake of space. 

In case of ties, to keep the presentation simple, assume that the distributions of are 

supported on Z. Then randomly breaking ties in the ranks amounts to using the following procedure. 
Let {ei}i^[n] be independent and uniformly distributed on (-c, c) with c < 1/2, also independent 
from {Zi]j^^Yn\- Consider Z[ = Zi + €i, i € [n] and let R[ be the rank of A' in the combined sample 
Then the joint distribution of {Ri}i^[n] is the same as that of when ties are 

broken randomly. 

For instance, for i f [s] 


E(R')=M|±i- ^ P(Z'>Z') 




n+s+l 


n+s+l 


E {^{Zi > Zj) + P(ej > ej\Zi - Zj) ¥‘{Zi - Zj)) 




- E Pjfi ■ 




2 
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The rest of the claims can be worked out similarly. 

Finally, when Zj have arbitrary distributions a similar method can be applied, although it 
requires a bit more care and one needs to take c approaching zero. 

7.5.2 Derivation of Tq in the normal location model 

Assume the normal model where Fq = For this case we can simply compute Tq. Since 

there are no ties with probability 1, we have 

X oo n oo 

/ ufo{u)dufo{x)dx. 

OO ^ cc 

Considering the inner integral we have 

Jr OO ^ r OO « 1 ^ 

ufo{u)du = _ / ^^du = _ e~^ ^ = fo{x) . 


Hence 

X cx) f-oo 1 2 1 

/o(x) = / —e-^ dx = -^ . 

cx. ’ J-oo 2 tt 2^ 

Therefore we conclude that l/(2\/3To) = \/t^- 
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